Lab 4: Extending Logistic Regression

Cameron Matson

Zihao Mao

In [107]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os

Data overview and Buisness Explanation

In this lab we aim to predict the position of an NBA player given some set of descriptors and statistics of their play. The original data set contains the player information and stats from every player that played in the NBA from 1950 to present.

Professional sports teams in general, and basketball teams in particular, have turned in recent years to data to gain an edge over their opponents. Part of the responsibilities of the team ownership and coaching staff are to assemble a team of players that give them best chance to win. Players on an NBA team have specific roles, largely based on the position that they play. It is, therefore, important that player play in the posision that most helps their team to win. We would like to create a classifier that would help NBA teams make decisions on which position players should play, which may or may not necessarilly be the position they have actually played. This could especially be the case for players coming from college, where teams generally run systems quite different from that employed in the NBA.

What our classifier will do, then is look at the stats and player details for each position and then, given a new set of stats and player make a probalistic estimation of which position that player plays. If the result diverges from the players listed position, this might be an indication that the player is not playing the correct position.

With respect to accuracy, its important to keep in mind that our classifier is meant to aid basketball personel decisions, not make them authoritatively. It's primary use is as a tool to compare a player to the performance and description of a players in the past, and report on the similarities the unkonwn player has to the different positions historically. So while we'd like to see as close to perfection in terms of prediction on splits of our dataset, what will ultimately be most useful to us and NBA teams is if the reported probability of a player playing any given position is significantly higher than that of the other possible positions.

Data inspection and Cleaning

In [108]:
# first lests load the datasets in

data_path = '../data/basketball'
players = pd.read_csv(os.path.join(data_path, 'players.csv'))
players.head()
Out[108]:
Unnamed: 0 Player height weight collage born birth_city birth_state
0 0 Curly Armstrong 180.0 77.0 Indiana University 1918.0 NaN NaN
1 1 Cliff Barker 188.0 83.0 University of Kentucky 1921.0 Yorktown Indiana
2 2 Leo Barnhorst 193.0 86.0 University of Notre Dame 1924.0 NaN NaN
3 3 Ed Bartels 196.0 88.0 North Carolina State University 1925.0 NaN NaN
4 4 Ralph Beard 178.0 79.0 University of Kentucky 1927.0 Hardinsburg Kentucky

We probably don't need the "collage [sic]" or the their birth locationg. Probably don't really need their birth year either, but it might be interesting to look at generational splits. Also that unnamed column looks just like the index, so we can drop that too.

In [109]:
players.drop(['Unnamed: 0', 'collage', 'birth_city', 'birth_state', 'born'], axis=1, inplace=True)
players.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3922 entries, 0 to 3921
Data columns (total 3 columns):
Player    3921 non-null object
height    3921 non-null float64
weight    3921 non-null float64
dtypes: float64(2), object(1)
memory usage: 92.0+ KB

Good. They're all non null, and seem to be the correct datatype.

Now let's load the players stats.

In [110]:
stats = pd.read_csv(os.path.join(data_path, 'seasons_stats.csv'))
stats.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24691 entries, 0 to 24690
Data columns (total 53 columns):
Unnamed: 0    24691 non-null int64
Year          24624 non-null float64
Player        24624 non-null object
Pos           24624 non-null object
Age           24616 non-null float64
Tm            24624 non-null object
G             24624 non-null float64
GS            18233 non-null float64
MP            24138 non-null float64
PER           24101 non-null float64
TS%           24538 non-null float64
3PAr          18839 non-null float64
FTr           24525 non-null float64
ORB%          20792 non-null float64
DRB%          20792 non-null float64
TRB%          21571 non-null float64
AST%          22555 non-null float64
STL%          20792 non-null float64
BLK%          20792 non-null float64
TOV%          19582 non-null float64
USG%          19640 non-null float64
blanl         0 non-null float64
OWS           24585 non-null float64
DWS           24585 non-null float64
WS            24585 non-null float64
WS/48         24101 non-null float64
blank2        0 non-null float64
OBPM          20797 non-null float64
DBPM          20797 non-null float64
BPM           20797 non-null float64
VORP          20797 non-null float64
FG            24624 non-null float64
FGA           24624 non-null float64
FG%           24525 non-null float64
3P            18927 non-null float64
3PA           18927 non-null float64
3P%           15416 non-null float64
2P            24624 non-null float64
2PA           24624 non-null float64
2P%           24496 non-null float64
eFG%          24525 non-null float64
FT            24624 non-null float64
FTA           24624 non-null float64
FT%           23766 non-null float64
ORB           20797 non-null float64
DRB           20797 non-null float64
TRB           24312 non-null float64
AST           24624 non-null float64
STL           20797 non-null float64
BLK           20797 non-null float64
TOV           19645 non-null float64
PF            24624 non-null float64
PTS           24624 non-null float64
dtypes: float64(49), int64(1), object(3)
memory usage: 10.0+ MB

There are a lot of fields here, and they're pretty inconsistently filled. Some of this arises from the fact that its such a long timeline. For example, in 1950, there was no such thing as a 3-pointer, so it wouldn't make sense for those players to have 3pt% stats.

Inspecting the dataset a little further, we notice that there is no stat for points per game (PPG). The total number of points scored is listed, but that is hard to compare across seasons where they played different games. To make the dataset more valid, i.e. to make the points column a valid comparisson measure, we'll only consider seasons in which they played the current full 82 game schedule. Which doesn't reduce the power of the dataset by that much, they moved to a 82 game season in 1967, and only the lockout shortened 1998-99 season didn't have a full scehdule.

Actually we might want to limit it to just seasons after 1980 when they introduced the 3 pointer. That should just make the prediction task easier, although we lose even more of the dataset. But if we consider the business case as being how to decide players posisitions TODAY it makes sense.

In [111]:
stats = stats[stats.Year >= 1980]
stats = stats[stats.Year != 1998]
stats.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18380 entries, 5727 to 24690
Data columns (total 53 columns):
Unnamed: 0    18380 non-null int64
Year          18380 non-null float64
Player        18380 non-null object
Pos           18380 non-null object
Age           18380 non-null float64
Tm            18380 non-null object
G             18380 non-null float64
GS            17686 non-null float64
MP            18380 non-null float64
PER           18375 non-null float64
TS%           18307 non-null float64
3PAr          18295 non-null float64
FTr           18295 non-null float64
ORB%          18375 non-null float64
DRB%          18375 non-null float64
TRB%          18375 non-null float64
AST%          18375 non-null float64
STL%          18375 non-null float64
BLK%          18375 non-null float64
TOV%          18321 non-null float64
USG%          18375 non-null float64
blanl         0 non-null float64
OWS           18380 non-null float64
DWS           18380 non-null float64
WS            18380 non-null float64
WS/48         18375 non-null float64
blank2        0 non-null float64
OBPM          18380 non-null float64
DBPM          18380 non-null float64
BPM           18380 non-null float64
VORP          18380 non-null float64
FG            18380 non-null float64
FGA           18380 non-null float64
FG%           18295 non-null float64
3P            18380 non-null float64
3PA           18380 non-null float64
3P%           14969 non-null float64
2P            18380 non-null float64
2PA           18380 non-null float64
2P%           18266 non-null float64
eFG%          18295 non-null float64
FT            18380 non-null float64
FTA           18380 non-null float64
FT%           17657 non-null float64
ORB           18380 non-null float64
DRB           18380 non-null float64
TRB           18380 non-null float64
AST           18380 non-null float64
STL           18380 non-null float64
BLK           18380 non-null float64
TOV           18380 non-null float64
PF            18380 non-null float64
PTS           18380 non-null float64
dtypes: float64(49), int64(1), object(3)
memory usage: 7.6+ MB

Now, to start, lets just focus on a few categories

  • Player
  • games played (G)
  • minutes played (MP)
  • field goals, feild goal attempts, and percentage (FG, FGA)
  • free throws (FT, FTA), two-pointers (2P, 2PA), and three-pointers (3P, 3PA)
  • offensive, defensive, and total rebounds (ORB, DRB, TRB)
  • assists (AST)
  • steals (STL)
  • blocks (BLK)
  • turnovers (TOV)
  • personal fouls (PF)
  • points (PTS)

And of course our label: position. We could probably use any of the features as a label actually, and see if one could predict performance in one aspect of the game based on info in the another. But for now we'll stick with predicting position.

In [112]:
stats_to_keep = {'Player', 'Pos', 'G', 'MP', 'FG', 'FGA', 'FT', 'FTA',
                '2P', '2PA', '3P', '3PA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
                'TOV', 'PF', 'PTS'}

stats_to_drop = set(stats.columns)-stats_to_keep
stats.drop(stats_to_drop, axis=1, inplace=True)
stats.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18380 entries, 5727 to 24690
Data columns (total 21 columns):
Player    18380 non-null object
Pos       18380 non-null object
G         18380 non-null float64
MP        18380 non-null float64
FG        18380 non-null float64
FGA       18380 non-null float64
3P        18380 non-null float64
3PA       18380 non-null float64
2P        18380 non-null float64
2PA       18380 non-null float64
FT        18380 non-null float64
FTA       18380 non-null float64
ORB       18380 non-null float64
DRB       18380 non-null float64
TRB       18380 non-null float64
AST       18380 non-null float64
STL       18380 non-null float64
BLK       18380 non-null float64
TOV       18380 non-null float64
PF        18380 non-null float64
PTS       18380 non-null float64
dtypes: float64(19), object(2)
memory usage: 3.1+ MB

Okay. Finally, let's add the player description data to the stats dataframe.

In [113]:
stats['height'] = np.nan
stats['weight'] = np.nan

iplayer = players.set_index(keys='Player')
istats = stats.reset_index(drop=True)
for i, row in istats.iterrows():
    name = row[0]
    h = iplayer.loc[name].loc['height']
    w = iplayer.loc[name].loc['weight']
    
    # height and weight show up in the last two columns
    istats.iloc[i, len(istats.columns) - 2] = h
    istats.iloc[i, len(istats.columns) - 1] = w

stats = istats
In [114]:
# and now we don't need the names anymore
stats.drop(['Player'], axis=1, inplace=True)

Finally let's convert the position from a string to a number 1-5, and divide up the dataset into data and label (X and y). There are technically more than 5 listed (some are multiple positions listed,) but we'll just go off of the first-listed, primary position.

In [115]:
# first we need to separate out the label from the data
y = stats.Pos
set(y)
Out[115]:
{'C',
 'C-PF',
 'C-SF',
 'PF',
 'PF-C',
 'PF-SF',
 'PG',
 'PG-SF',
 'PG-SG',
 'SF',
 'SF-PF',
 'SF-SG',
 'SG',
 'SG-PF',
 'SG-PG',
 'SG-SF'}

lets reclassify this numerically and only count their 'primary ' position, so that each player will be given a position 1-5 {(1, pg), (2, sg), (3, sf), (4, pf), (5, c)}

In [116]:
import numpy as np
def convert_pos(y):
    newy = np.zeros((len(y), 1))
    for i, player in enumerate(y):
        if (player[0] == 'C'):
            newy[i] = 5
        elif (player[0:2] == 'PF'):
            newy[i] = 4
        elif (player[0:2] == 'SF'):
            newy[i] = 3
        elif (player[0:2] == 'SG'):
            newy[i] = 2
        elif (player[0:2] == 'PG'):
            newy[i] = 1
    return newy
In [117]:
y = convert_pos(y)
y
Out[117]:
array([[ 5.],
       [ 4.],
       [ 5.],
       ..., 
       [ 5.],
       [ 3.],
       [ 5.]])
In [118]:
# let's update it in the dataframe just for fun

stats['Pos'] = y
In [119]:
from sklearn import preprocessing

# now we can drop the label, and we'll scale it to zero mean
X = stats.drop(['Pos'], axis=1)

X = preprocessing.scale(X, axis=1)
X
Out[119]:
array([[-0.76045343,  3.26281462,  0.22926263, ...,  1.805185  ,
        -0.58169993, -0.73416615],
       [-0.42897319,  4.10073721, -0.09169605, ...,  0.72796583,
         0.09655165, -0.30347472],
       [-0.71099342,  3.54132577,  0.08136419, ...,  1.40805527,
        -0.44484253, -0.67035969],
       ..., 
       [-0.24896508,  1.52029628, -0.42787915, ..., -0.16944772,
         3.60762709,  1.52029628],
       [-0.44509845,  4.14513315, -0.19231973, ...,  0.68091581,
         0.4683519 , -0.140615  ],
       [-0.61372171,  3.54633314,  0.02740758, ...,  1.17852609,
         0.683108  , -0.01630578]])

Before we go any further let's see what some of the distributions are

In [120]:
# first, how many of each class
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook

graph1 = {'labels': ['PG', 'SG', 'SF', 'PF', 'C'],
          'values': np.bincount(y[:,0].astype(np.int32)-1), # -1 because it's looking for 0
            'type': 'pie'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Total Class Distribution',
                'autosize':False,
                'width':500,
                'height':400}

plotly.offline.iplot(fig)

Pretty even across the board. Not a requirement but nice to know. Also it makes sense. There are 5 unique positions in basketball, none of them are doubled (at least in theory), therefore there should be approximately the same number of them.

In [121]:
# now let's look at the distribution for a few of the features

sns.kdeplot(data=stats[stats.Pos == 1].height)
sns.kdeplot(data=stats[stats.Pos == 2].height)
sns.kdeplot(data=stats[stats.Pos == 3].height)
sns.kdeplot(data=stats[stats.Pos == 4].height)
sns.kdeplot(data=stats[stats.Pos == 5].height)

plt.legend(['PG', 'SG', 'SF', 'PF', 'C'])
plt.show()

This matches common reasoning that PG (the 1) are the the players and C (the 5) are the tallest. I bet the height and weight are correllatted too.

In [122]:
sns.jointplot(x="height", y="weight", data=stats);

Good intuition. What this means is that if the weight on height is high in our model it likely will be high for 'weight' (the feature) as well, and vice versa.

Dimensionality Reduction

PCA, find out the weights of each feature, and reduce the number of features required for calculation. Improves the execution time.

First, let's find out the number of components that has the wanted explained variance ratio

In [123]:
def plot_explained_variance(pca):
    import plotly
    from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
    plotly.offline.init_notebook_mode() # run at the start of every notebook
    
    explained_var = pca.explained_variance_ratio_
    cum_var_exp = np.cumsum(explained_var)
    
    plotly.offline.iplot({
        "data": [Bar(y=explained_var, name='individual explained variance'),
                 Scatter(y=cum_var_exp, name='cumulative explained variance')
            ],
        "layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
    })
In [124]:
from sklearn.decomposition import PCA
n_components = 19
print ("Extracting the top %d eigenfaces from %d faces" % (
    n_components, stats.shape[1]))
pca = PCA(n_components=n_components)
%time 
stat_pca = pca.fit(X.copy()).transform(X.copy())

pca.components_.shape
Extracting the top 19 eigenfaces from 22 faces
Wall time: 0 ns
Out[124]:
(19, 21)
In [125]:
plot_explained_variance(pca)

According to the diagram, we can reach a sufficient level of explained variance with 5 components. We can probably deal with the accuracy that 98% of the actual information given the reduction from ~25 features to 5 which will greatly increase the speed of our analysis.

In [126]:
n_components = 5
print ("Extracting the top %d eigenfaces from %d faces" % (
    n_components, stats.shape[1]))
pca = PCA(n_components=n_components)
%time 
stats_pca = pca.fit(X.copy()).transform(X.copy())

pca.components_.shape
print(stats_pca)
Extracting the top 5 eigenfaces from 22 faces
Wall time: 0 ns
[[-1.84229988  1.12249995  0.73148395  0.04607751  0.04665367]
 [-1.19627888 -0.38903756  0.11605488 -0.00430811 -0.25647551]
 [-1.7538767   0.60997801  0.61099984  0.04118508 -0.15904718]
 ..., 
 [ 3.68474302 -0.28049574  0.32280632 -0.18881503  0.05552414]
 [-0.78579639 -0.72589037 -0.37445887  0.19133414  0.35910432]
 [-0.62166432  0.17395247  0.53991542 -0.63727906  0.16271986]]
In [127]:
stats_pca = pd.DataFrame(stats_pca, columns = ['first', 'second','third','forth','fifth'])
stats_nopos = stats
stats_nopos.drop(['Pos'], axis=1, inplace=True)

pca_absolute = np.absolute(pca.components_)

Visualize the weights of each features:

In [128]:
import plotly.graph_objs as go
import plotly.plotly as py
import plotly
from random import randint
plotly.offline.init_notebook_mode() # run at the start of every notebook
colors = []

for i in range(26):
    colors.append('%06X' % randint(0, 0xFFFFFF))
In [129]:
trace1 = go.Bar(
    x=stats.columns,
    y=pca_absolute[0],
    name='First',
    marker=dict(
        color=colors)
)
trace2 = go.Bar(
    x=stats.columns,
    y=pca_absolute[1],
    name='Second',
    marker=dict(
        color=colors)
)
trace3 = go.Bar(
    x=stats.columns,
    y=pca_absolute[2],
    name='Third',
    marker=dict(
        color=colors)
)
trace4 = go.Bar(
    x=stats.columns,
    y=pca_absolute[3],
    name='Forth',
    marker=dict(
        color=colors)
)
trace5 = go.Bar(
    x=stats.columns,
    y=pca_absolute[4],
    name='Fifth',
    marker=dict(
        color=colors)
)
data = [trace1, trace2, trace3, trace4, trace5]

layout = go.Layout(
    barmode = 'group'
)

fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig, filename = 'grouped-bar')

As we can see from the graph, there are certain features that are weighted heavily in most of the top 5 principal components (and significantly in at least one), namely minutes played, 3-point attmempts, 2-point attempts, total rebounds, assists, and height.

This is important to note because it might help inform our decision given the output of the classifier if we look at these features for a given input.

From here on, we'll be using these 5 components instead of the full feature set in order to speed up calculations. We can be reasonably sure we are not losing much information because these 5 components explain 98% of the total variance.

In [130]:
X = stats_pca.values
X = preprocessing.scale(X, axis=1)
X.shape
Out[130]:
(18380, 5)

Logistic Regression Model

Now we'll just start out by defining the Binary Logistic Regression classes as we did in class.

In [131]:
class BinaryLogisticRegressionBase:
    # private:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        return 'Base Binary Logistic Regression Object, Not Trainable'
    
    # convenience, private:
    @staticmethod
    def _sigmoid(theta):
        return 1/(1+np.exp(-theta)) 
    
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term

    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction
    
    
        
blr = BinaryLogisticRegressionBase(0.1)
print(blr)
Base Binary Logistic Regression Object, Not Trainable
In [132]:
# inherit from base class
class BinaryLogisticRegression(BinaryLogisticRegressionBase):
    #private:
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    def _get_gradient(self,X,y):
        # programming \sum_i (yi-g(xi))xi
        gradient = np.zeros(self.w_.shape) # set gradient to zero
        for (xi,yi) in zip(X,y):
            # the actual update inside of sum
            gradi = (yi - self.predict_proba(xi,add_bias=False))*xi 
            # reshape to be column vector and add to gradient
            gradient += gradi.reshape(self.w_.shape) 
        
        return gradient/float(len(y))
       
    # public:
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iters):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 

            
blr = BinaryLogisticRegression(0.1)
print(blr)
Untrained Binary Logistic Regression Object
In [133]:
# now lets do some vectorized coding
import numpy as np
from scipy.special import expit

class VectorBinaryLogisticRegression(BinaryLogisticRegression):
    # inherit from our previous class to get same functionality
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    # but overwrite the gradient calculation
    def _get_gradient(self,X,y):
        yhat = self.predict_proba(X, add_bias=False).ravel()[:, np.newaxis]
        ydiff = y-yhat # get y difference

        
        gradient = np.mean(X * ydiff, axis=0) # make ydiff a column vector and multiply through
        
        return gradient.reshape(self.w_.shape)

Binary Case

Compare one vs. all. Starting with the 'Center' position which is #5.

An 80/20 split is a reasonable split for our dataset, since we have a large number of instances, and each of the classes is essentially equally represented (this doesn't really matter since the split is stratified anyway, but it's a nice perk.) We're also not dealing with any time series, or at least we're not defining them to be as such.

In [134]:
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
yb = (y>4).astype(np.int)
params = dict(eta=0.2,
              iterations=1000)
blr = VectorBinaryLogisticRegression(**params)

num_cv_iterations = 10
num_instances = len(yb)
cv_object = ShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2)
                         
In [135]:
score = np.zeros((10,)).astype(np.float)
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,yb)):
    blr.fit(X[train_indices],yb[train_indices])  # train object
    y_hat = blr.predict(X[test_indices]) # get test set precitions

    # print the accuracy and confusion matrix 
    print("====Iteration",iter_num," ====")
    s = accuracy_score(yb[test_indices],y_hat)
    print("accuracy", s) 
    score[iter_num] = s
print("mean accuracy:", np.mean(score), "(", np.std(score),")")
====Iteration 0  ====
accuracy 0.808759521219
====Iteration 1  ====
accuracy 0.812840043526
====Iteration 2  ====
accuracy 0.820729053319
====Iteration 3  ====
accuracy 0.804406964091
====Iteration 4  ====
accuracy 0.817736670294
====Iteration 5  ====
accuracy 0.816920565832
====Iteration 6  ====
accuracy 0.810935799782
====Iteration 7  ====
accuracy 0.816104461371
====Iteration 8  ====
accuracy 0.81528835691
====Iteration 9  ====
accuracy 0.806583242655
mean accuracy: 0.8130304679 ( 0.00499693237699 )

Let's do the same for all classes:

In [136]:
means = np.zeros((5,))
stds = np.zeros((5,))
for i in range(5):
    yb = (y==i+1).astype(np.int)
    params = dict(eta=0.2,
              iterations=1000)
    blr = VectorBinaryLogisticRegression(**params)

    num_cv_iterations = 10
    num_instances = len(yb)
    cv_object = ShuffleSplit(
                             n_splits=num_cv_iterations,
                             test_size  = 0.2)
    for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,yb)):
        blr.fit(X[train_indices],yb[train_indices])  # train object
        y_hat = blr.predict(X[test_indices]) # get test set precitions

        s = accuracy_score(yb[test_indices],y_hat)
        score[iter_num] = s
        
    means[i] = np.mean(score)
    stds[i] = np.std(score)
    
    print("mean accuracy: ", means[i], "+-", stds[i])
mean accuracy:  0.880250272035 +- 0.00320447319317
mean accuracy:  0.793688792165 +- 0.00685851522069
mean accuracy:  0.805440696409 +- 0.00572541211363
mean accuracy:  0.778808487486 +- 0.00388438671589
mean accuracy:  0.809194776931 +- 0.00842447523432
In [137]:
data = [go.Bar(
            x=['PG', 'SG', 'SF','PF','C'],
            y=means,
            error_y=dict(
                type='data',
                array=stds,
            ),
    )]
layout = go.Layout(
    title = "Accuracies for each classes"
)
fig = go.Figure(data = data, layout = layout)
plotly.offline.iplot(fig, filename='basic-bar')

Really small error bars, you love to see that. Meaning, that our train/test splits produce consistent results, and should be pretty representative of real data.

Multiclass

In [138]:
class LogisticRegression:
    def __init__(self, eta, iterations=20):
        self.eta = eta
        self.iters = iterations
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = y==yval # create a binary problem
            # train the binary classifier for this class
            blr = VectorBinaryLogisticRegression(self.eta,self.iters)
            blr.fit(X,y_binary)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X)) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row
    
lr = LogisticRegression(0.1,1500)
print(lr)
Untrained MultiClass Logistic Regression Object
In [139]:
%%time
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

lr = LogisticRegression(0.1,1000)
lr.fit(X_train,y_train)
print(lr)

yhat = lr.predict(X_test)
print('Accuracy of: ',accuracy_score(y_test,yhat))
MultiClass Logistic Regression Object with coefficients:
[[-2.07723325  0.31691119  1.22029855 -2.10612338  1.93820282 -1.36928918]
 [-1.61114675  0.0381536   0.70718442 -1.04856797 -0.05764312  0.36087307]
 [-1.57583832 -0.10361187  0.05266586 -0.18663289 -0.38284305  0.62042196]
 [-1.58887798 -0.04994551 -0.36250931  0.9003603  -0.53724417  0.04933868]
 [-1.9109355   0.16354686 -0.64721556  1.78250495 -0.25555311 -1.04328314]]
Accuracy of:  0.176822633297
Wall time: 3.05 s

Extending the Logistic regression

Now we'll use various optimization techniques and varying parameters to see if we can get better results.

First we'll set up all the classes that we need.

In [140]:
# from last time, our logistic regression algorithm is given by (including everything we previously had):
class BinaryLogisticRegression:
    def __init__(self, eta, iterations=20, C=0.001, regularization='L2'):
        self.eta = eta
        self.iterations = iterations
        self.C = C
        self.regularization = regularization
        # internally we will store the weights as self.w_ to keep with sklearn conventions
        
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'Binary Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained Binary Logistic Regression Object'
        
    # convenience, private:
    @staticmethod
    def _add_bias(X):
        return np.hstack((np.ones((X.shape[0],1)),X)) # add bias term
    
    @staticmethod
    def _sigmoid(theta):
        # increase stability, redefine sigmoid operation
        return expit(theta) #1/(1+np.exp(-theta))
    
    # vectorized gradient calculation with regularization
    def _get_gradient(self,X,y):
        
        ydiff = y-self.predict_proba(X, add_bias=False) # get y difference
        gradient = np.mean(X * ydiff, axis=0) # make ydiff a column vector and multiply through
        
        gradient = gradient.reshape(self.w_.shape)
        
        # gradient will be different depending on which regularization technique we use
        L1 = np.sign(self.w_[1:]) * self.C
        L2 = -2 * self.w_[1:] * self.C
        if (self.regularization=='L1'):
            gradient[1:] += L1
        elif (self.regularization == 'L2'):
            gradient[1:] += L2
        elif (self.regularization == 'L1/L2'):
            gradient[1:] += (L1 + L2)
        return gradient
    
    # public:
    def predict_proba(self,X,add_bias=True):
        # add bias term if requested
        Xb = self._add_bias(X) if add_bias else X
        return self._sigmoid(Xb @ self.w_) # return the probability y=1
    
    def predict(self,X):
        return (self.predict_proba(X)>0.5) #return the actual prediction
    
    
    def fit(self, X, y):
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = np.zeros((num_features,1)) # init weight vector to zeros
        
        # for as many as the max iterations
        for _ in range(self.iterations):
            gradient = self._get_gradient(Xb,y)
            self.w_ += gradient*self.eta # multiply by learning rate 

Test it out

In [141]:
blr = BinaryLogisticRegression(eta=0.1,iterations=500,C=0.1)

blr.fit(X,yb)
print(blr)

yhat = blr.predict(X)
print('Accuracy of: ',accuracy_score(yb,yhat))
Binary Logistic Regression Object with coefficients:
[[-1.42879456]
 [ 0.04069606]
 [-0.27355813]
 [ 0.39314347]
 [-0.08643905]
 [-0.07384235]]
Accuracy of:  0.79907508161

Stochastic Class

In [142]:
class StochasticLogisticRegression(BinaryLogisticRegression):
    # stochastic gradient calculation 
    def _get_gradient(self,X,y):
        idx = int(np.random.rand()*len(y)) # grab random instance
        ydiff = y[idx]-self.predict_proba(X[idx],add_bias=False) # get y difference (now scalar)
        gradient = X[idx] * ydiff[:,np.newaxis] # make ydiff a column vector and multiply through
        
        gradient = gradient.reshape(self.w_.shape)
        
        gradient[1:] 
        L1 = np.sign(w) * self.C
        L2 = -2 * self.w_[1:] * self.C
        if (self.regularization=='L1'):
            gradient[1:] += L1
        elif (self.regularization == 'L2'):
            gradient[1:] += L2
        elif (self.regularization == 'L1/L2'):
            gradient[1:] += (L1 + L2)
        return gradient
        

Implementation of BFGS

In [143]:
from scipy.optimize import fmin_bfgs
class BFGSBinaryLogisticRegression(BinaryLogisticRegression):
    
    @staticmethod
    def objective_function(w,X,y,C):
        g = expit(X @ w)
        return -np.sum(np.log(g[y==1]))-np.sum(np.log(1-g[y==0])) + C*sum(w**2) #-np.sum(y*np.log(g)+(1-y)*np.log(1-g))

    @staticmethod
    def objective_gradient(w,X,y,C):
        g = expit(X @ w)
        ydiff = y-g # get y difference
        gradient = np.mean(X * ydiff[:,np.newaxis], axis=0)
        gradient = gradient.reshape(w.shape)
        
        if (self.regularization=='L1'):
            gradient[1:] += L1
        elif (self.regularization == 'L2'):
            gradient[1:] += L2
        elif (self.regularization == 'L1/L2'):
            gradient[1:] += (L1 + L2)
        
        return gradient
    
    # just overwrite fit function
    def fit(self, X, y):
        # overwrite self.iters, because we know we don't need it
        self.iters = 3
        Xb = self._add_bias(X) # add bias term
        num_samples, num_features = Xb.shape
        
        self.w_ = fmin_bfgs(self.objective_function, # what to optimize
                            np.zeros((num_features,1)), # starting point
                            fprime=self.objective_gradient, # gradient function
                            args=(Xb,y,self.C), # extra args for gradient and objective function
                            gtol=1e-03, # stopping criteria for gradient, |v_k|
                            maxiter=self.iters, # stopping criteria iterations
                            disp=False)
        
        self.w_ = self.w_.reshape((num_features,1))

Note that all of the above are BINARY classifiers. I.e. they make a one vs. all comparrison. To get multiclass we need to do that for all of the classes. The class below does just that. Its the same as the multiclass implementation above, modifed to have a choice of optimization and regularization method, as well as some requisite helper functions to be used with SKLearn's GridSearchCV.

Using GridSearchCV will allow us to optimize the free parameters we've created (iteration, C, optimization technique, and regularization method) in order to find the best solution (as defined by the score() function which we have said is just the accuracy)

In [144]:
import warnings
class LRClassifier:
    def __init__(self, eta=0.1, iterations=20, C=0.001, optimization='full', regularization='L2'):
        self.eta = eta
        self.iterations = iterations
        self.C = C
        self.opt = optimization
        self.regularization = regularization
        # internally we will store the weights as self.w_ to keep with sklearn conventions
    
    def __str__(self):
        if(hasattr(self,'w_')):
            return 'MultiClass Logistic Regression Object with coefficients:\n'+ str(self.w_) # is we have trained the object
        else:
            return 'Untrained MultiClass Logistic Regression Object'
        
    def fit(self,X,y):
        num_samples, num_features = X.shape
        self.unique_ = np.unique(y) # get each unique class value
        num_unique_classes = len(self.unique_)
        self.classifiers_ = [] # will fill this array with binary classifiers
        
        for i,yval in enumerate(self.unique_): # for each unique value
            y_binary = y==yval # create a binary problem
            # train the binary classifier for this class
            if (self.opt == 'stochastic'):
                blr = StochasticLogisticRegression(self.eta, self.iterations, self.C, self.regularization)
            elif (self.opt == 'bfgs'):
                blr = BFGSBinaryLogisticRegression(_, 2, self.C, self.regularization)
            else:
                blr = BinaryLogisticRegression(self.eta, self.iterations, self.C, self.regularization)
            
            blr.fit(X,y_binary)
            # add the trained classifier to the list
            self.classifiers_.append(blr)
            
        # save all the weights into one matrix, separate column for each class
        self.w_ = np.hstack([x.w_ for x in self.classifiers_]).T
        
    def predict_proba(self,X):
        probs = []
        for blr in self.classifiers_:
            probs.append(blr.predict_proba(X)) # get probability for each classifier
        
        return np.hstack(probs) # make into single matrix
    
    def predict(self,X):
        return np.argmax(self.predict_proba(X),axis=1) # take argmax along row

    def score(self,*args, **kwargs):
        yhat = self.predict(X)
        return accuracy_score(y, yhat+1)
    
    # lifted straight out of sklearn source code with some modifications
    @classmethod
    def _get_param_names(cls):
        # this is just specific to this classifier
        return sorted(['eta', 'iterations', 'C', 'optimization', 'regularization'])

    def get_params(self, deep=True):
        """Get parameters for this estimator.
        Parameters
        ----------
        deep : boolean, optional
            If True, will return the parameters for this estimator and
            contained subobjects that are estimators.
        Returns
        -------
        params : mapping of string to any
            Parameter names mapped to their values.
        """
        out = dict()
        for key in self._get_param_names():
            # We need deprecation warnings to always be on in order to
            # catch deprecated param values.
            # This is set in utils/__init__.py but it gets overwritten
            # when running under python3 somehow.
            warnings.simplefilter("always", DeprecationWarning)
            try:
                with warnings.catch_warnings(record=True) as w:
                    value = getattr(self, key, None)
                if len(w) and w[0].category == DeprecationWarning:
                    # if the parameter is deprecated, don't show it
                    continue
            finally:
                warnings.filters.pop(0)

            # XXX: should we rather test if instance of estimator?
            if deep and hasattr(value, 'get_params'):
                deep_items = value.get_params().items()
                out.update((key + '__' + k, val) for k, val in deep_items)
            out[key] = value
        return out
     
    def set_params(self, **params):
        """Set the parameters of this estimator.
        The method works on simple estimators as well as on nested objects
        (such as pipelines). The latter have parameters of the form
        ``<component>__<parameter>`` so that it's possible to update each
        component of a nested object.
        Returns
        -------
        self
        """
        if not params:
            # Simple optimization to gain speed (inspect is slow)
            return self
        valid_params = self.get_params(deep=True)
        # changed from six.iteritems() bc no need for py2 vs py3 compatabillity
        for key, value in params.items():
            split = key.split('__', 1)
            if len(split) > 1:
                # nested objects case
                name, sub_name = split
                if name not in valid_params:
                    raise ValueError('Invalid parameter %s for estimator %s. '
                                     'Check the list of available parameters '
                                     'with `estimator.get_params().keys()`.' %
                                     (name, self))
                sub_object = valid_params[name]
                sub_object.set_params(**{sub_name: value})
            else:
                # simple objects case
                if key not in valid_params:
                    raise ValueError('Invalid parameter %s for estimator %s. '
                                     'Check the list of available parameters '
                                     'with `estimator.get_params().keys()`.' %
                                     (key, self.__class__.__name__))
                setattr(self, key, value)
        return self

        

Let's try it out! This will take some time

In [145]:
%%time

from sklearn.model_selection import GridSearchCV
parameters = {'iterations': [20, 50, 100],
              'C':np.logspace(base=10, start=-3, stop=1, num=5), 
              'optimization': ['full', 'stochastic', 'bfgs'],
              'regularization': ['L1', 'L2', 'L1/L2']}
lrc = LRClassifier(eta=0.1)
clf = GridSearchCV(lrc, parameters)
clf.fit(X, y)
Wall time: 1min 4s
In [146]:
cvresults = pd.DataFrame(clf.cv_results_)
cvresults
Out[146]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_C param_iterations param_optimization param_regularization params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
0 0.053644 0.002172 0.436961 0.436960 0.001 20 full L1 {'C': 0.001, 'iterations': 20, 'optimization':... 34 0.439989 0.439989 0.435201 0.435201 0.435691 0.435691 0.003090 6.255609e-04 0.002151 0.002151
1 0.053963 0.003008 0.437106 0.437106 0.001 20 full L2 {'C': 0.001, 'iterations': 20, 'optimization':... 22 0.440424 0.440424 0.435473 0.435473 0.435419 0.435419 0.001336 1.123916e-07 0.002347 0.002347
2 0.050969 0.002507 0.436997 0.436997 0.001 20 full L1/L2 {'C': 0.001, 'iterations': 20, 'optimization':... 31 0.439989 0.439989 0.435310 0.435310 0.435691 0.435691 0.000625 1.946680e-07 0.002122 0.002122
3 0.051637 0.003175 0.436961 0.436960 0.001 20 stochastic L1 {'C': 0.001, 'iterations': 20, 'optimization':... 34 0.439989 0.439989 0.435201 0.435201 0.435691 0.435691 0.001083 2.362475e-04 0.002151 0.002151
4 0.053039 0.003008 0.437106 0.437106 0.001 20 stochastic L2 {'C': 0.001, 'iterations': 20, 'optimization':... 22 0.440424 0.440424 0.435473 0.435473 0.435419 0.435419 0.002222 4.095814e-04 0.002347 0.002347
5 0.052129 0.002674 0.436997 0.436997 0.001 20 stochastic L1/L2 {'C': 0.001, 'iterations': 20, 'optimization':... 31 0.439989 0.439989 0.435310 0.435310 0.435691 0.435691 0.001097 2.361910e-04 0.002122 0.002122
6 0.052296 0.002683 0.436961 0.436960 0.001 20 bfgs L1 {'C': 0.001, 'iterations': 20, 'optimization':... 34 0.439989 0.439989 0.435201 0.435201 0.435691 0.435691 0.001242 2.484416e-04 0.002151 0.002151
7 0.052974 0.002674 0.437106 0.437106 0.001 20 bfgs L2 {'C': 0.001, 'iterations': 20, 'optimization':... 22 0.440424 0.440424 0.435473 0.435473 0.435419 0.435419 0.001030 2.367529e-04 0.002347 0.002347
8 0.051136 0.002841 0.436997 0.436997 0.001 20 bfgs L1/L2 {'C': 0.001, 'iterations': 20, 'optimization':... 31 0.439989 0.439989 0.435310 0.435310 0.435691 0.435691 0.002047 2.365846e-04 0.002122 0.002122
9 0.133050 0.002340 0.434857 0.434857 0.001 50 full L1 {'C': 0.001, 'iterations': 50, 'optimization':... 43 0.429053 0.429053 0.440098 0.440098 0.435419 0.435419 0.002344 2.368655e-04 0.004527 0.004526
10 0.131517 0.002682 0.434639 0.434639 0.001 50 full L2 {'C': 0.001, 'iterations': 50, 'optimization':... 52 0.428618 0.428618 0.439826 0.439826 0.435473 0.435473 0.002019 2.304091e-04 0.004614 0.004613
11 0.128349 0.002506 0.434784 0.434784 0.001 50 full L1/L2 {'C': 0.001, 'iterations': 50, 'optimization':... 46 0.428781 0.428781 0.439935 0.439935 0.435637 0.435637 0.004712 2.973602e-07 0.004593 0.004593
12 0.131354 0.002674 0.434857 0.434857 0.001 50 stochastic L1 {'C': 0.001, 'iterations': 50, 'optimization':... 43 0.429053 0.429053 0.440098 0.440098 0.435419 0.435419 0.002278 2.362471e-04 0.004527 0.004526
13 0.130514 0.003008 0.434639 0.434639 0.001 50 stochastic L2 {'C': 0.001, 'iterations': 50, 'optimization':... 52 0.428618 0.428618 0.439826 0.439826 0.435473 0.435473 0.005118 4.093868e-04 0.004614 0.004613
14 0.143382 0.002841 0.434784 0.434784 0.001 50 stochastic L1/L2 {'C': 0.001, 'iterations': 50, 'optimization':... 46 0.428781 0.428781 0.439935 0.439935 0.435637 0.435637 0.010333 4.723819e-04 0.004593 0.004593
15 0.141217 0.002841 0.434857 0.434857 0.001 50 bfgs L1 {'C': 0.001, 'iterations': 50, 'optimization':... 43 0.429053 0.429053 0.440098 0.440098 0.435419 0.435419 0.001429 2.360790e-04 0.004527 0.004526
16 0.138510 0.003008 0.434639 0.434639 0.001 50 bfgs L2 {'C': 0.001, 'iterations': 50, 'optimization':... 52 0.428618 0.428618 0.439826 0.439826 0.435473 0.435473 0.002030 1.946680e-07 0.004614 0.004613
17 0.137557 0.002674 0.434784 0.434784 0.001 50 bfgs L1/L2 {'C': 0.001, 'iterations': 50, 'optimization':... 46 0.428781 0.428781 0.439935 0.439935 0.435637 0.435637 0.001308 4.726629e-04 0.004593 0.004593
18 0.274738 0.003176 0.441204 0.441204 0.001 100 full L1 {'C': 0.001, 'iterations': 100, 'optimization'... 10 0.434168 0.434168 0.448912 0.448912 0.440533 0.440533 0.000819 2.375959e-04 0.006038 0.006038
19 0.268821 0.002841 0.440207 0.440207 0.001 100 full L2 {'C': 0.001, 'iterations': 100, 'optimization'... 16 0.433079 0.433079 0.448585 0.448585 0.438955 0.438955 0.010013 2.362475e-04 0.006392 0.006392
20 0.264016 0.003008 0.440896 0.440896 0.001 100 full L1/L2 {'C': 0.001, 'iterations': 100, 'optimization'... 13 0.433841 0.433841 0.448422 0.448422 0.440424 0.440424 0.004021 1.946680e-07 0.005962 0.005962
21 0.263541 0.002516 0.441204 0.441204 0.001 100 stochastic L1 {'C': 0.001, 'iterations': 100, 'optimization'... 10 0.434168 0.434168 0.448912 0.448912 0.440533 0.440533 0.006923 1.258786e-05 0.006038 0.006038
22 0.267385 0.002841 0.440207 0.440207 0.001 100 stochastic L2 {'C': 0.001, 'iterations': 100, 'optimization'... 16 0.433079 0.433079 0.448585 0.448585 0.438955 0.438955 0.003336 2.359100e-04 0.006392 0.006392
23 0.256296 0.003342 0.440896 0.440896 0.001 100 stochastic L1/L2 {'C': 0.001, 'iterations': 100, 'optimization'... 13 0.433841 0.433841 0.448422 0.448422 0.440424 0.440424 0.002313 4.725505e-04 0.005962 0.005962
24 0.280830 0.003008 0.441204 0.441204 0.001 100 bfgs L1 {'C': 0.001, 'iterations': 100, 'optimization'... 10 0.434168 0.434168 0.448912 0.448912 0.440533 0.440533 0.009017 6.836514e-07 0.006038 0.006038
25 0.273372 0.002850 0.440207 0.440207 0.001 100 bfgs L2 {'C': 0.001, 'iterations': 100, 'optimization'... 16 0.433079 0.433079 0.448585 0.448585 0.438955 0.438955 0.021189 2.230974e-04 0.006392 0.006392
26 0.256797 0.002744 0.440896 0.440896 0.001 100 bfgs L1/L2 {'C': 0.001, 'iterations': 100, 'optimization'... 13 0.433841 0.433841 0.448422 0.448422 0.440424 0.440424 0.006777 3.739269e-04 0.005962 0.005962
27 0.053387 0.002674 0.432191 0.432191 0.01 20 full L1 {'C': 0.01, 'iterations': 20, 'optimization': ... 58 0.432644 0.432644 0.431665 0.431665 0.432263 0.432263 0.002162 4.728318e-04 0.000403 0.000403
28 0.052962 0.002675 0.437088 0.437087 0.01 20 full L2 {'C': 0.01, 'iterations': 20, 'optimization': ... 28 0.440370 0.440370 0.435473 0.435473 0.435419 0.435419 0.000931 4.723258e-04 0.002321 0.002321
29 0.053221 0.003175 0.431919 0.431919 0.01 20 full L1/L2 {'C': 0.01, 'iterations': 20, 'optimization': ... 61 0.432372 0.432372 0.431502 0.431502 0.431882 0.431882 0.001675 6.254761e-04 0.000356 0.000356
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
105 0.266612 0.002850 0.393434 0.393435 1 100 bfgs L1 {'C': 1.0, 'iterations': 100, 'optimization': ... 88 0.399728 0.399728 0.378727 0.378727 0.401850 0.401850 0.006163 2.430105e-04 0.010436 0.010436
106 0.255389 0.002506 0.376496 0.376496 1 100 bfgs L2 {'C': 1.0, 'iterations': 100, 'optimization': ... 100 0.375517 0.375517 0.373395 0.373395 0.380577 0.380577 0.007819 4.092894e-04 0.003013 0.003013
107 0.259005 0.002841 0.361370 0.361371 1 100 bfgs L1/L2 {'C': 1.0, 'iterations': 100, 'optimization': ... 106 0.366485 0.366485 0.343906 0.343906 0.373721 0.373721 0.009089 2.363596e-04 0.012698 0.012698
108 0.050109 0.002674 0.353500 0.353500 10 20 full L1 {'C': 10.0, 'iterations': 20, 'optimization': ... 109 0.354897 0.354897 0.344886 0.344886 0.360718 0.360718 0.002177 2.362472e-04 0.006539 0.006539
109 0.051980 0.002999 0.181755 0.181756 10 20 full L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 127 0.145539 0.145539 0.210555 0.210555 0.189173 0.189173 0.000628 1.241965e-05 0.027057 0.027056
110 0.055824 0.003008 0.207324 0.207327 10 20 full L1/L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 124 0.132427 0.132427 0.227584 0.227584 0.261970 0.261970 0.002262 5.150430e-07 0.054791 0.054791
111 0.052291 0.002674 0.353500 0.353500 10 20 stochastic L1 {'C': 10.0, 'iterations': 20, 'optimization': ... 109 0.354897 0.354897 0.344886 0.344886 0.360718 0.360718 0.002433 4.731125e-04 0.006539 0.006539
112 0.050302 0.002182 0.181755 0.181756 10 20 stochastic L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 127 0.145539 0.145539 0.210555 0.210555 0.189173 0.189173 0.000851 6.374911e-04 0.027057 0.027056
113 0.054814 0.002674 0.207324 0.207327 10 20 stochastic L1/L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 124 0.132427 0.132427 0.227584 0.227584 0.261970 0.261970 0.002502 4.729438e-04 0.054791 0.054791
114 0.054953 0.003008 0.353500 0.353500 10 20 bfgs L1 {'C': 10.0, 'iterations': 20, 'optimization': ... 109 0.354897 0.354897 0.344886 0.344886 0.360718 0.360718 0.003741 1.123916e-07 0.006539 0.006539
115 0.053477 0.003008 0.181755 0.181756 10 20 bfgs L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 127 0.145539 0.145539 0.210555 0.210555 0.189173 0.189173 0.000945 2.247832e-07 0.027057 0.027056
116 0.059044 0.003343 0.207324 0.207327 10 20 bfgs L1/L2 {'C': 10.0, 'iterations': 20, 'optimization': ... 124 0.132427 0.132427 0.227584 0.227584 0.261970 0.261970 0.003456 4.727753e-04 0.054791 0.054791
117 0.134998 0.003008 0.347189 0.347189 10 50 full L1 {'C': 10.0, 'iterations': 50, 'optimization': ... 112 0.347171 0.347171 0.341023 0.341023 0.353373 0.353373 0.004608 2.973602e-07 0.005042 0.005042
118 0.141543 0.003009 0.176695 0.176696 10 50 full L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 133 0.178074 0.178074 0.169750 0.169750 0.182263 0.182263 0.001023 2.973602e-07 0.005201 0.005201
119 0.150622 0.003843 0.231626 0.231629 10 50 full L1/L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 118 0.183896 0.183896 0.229489 0.229489 0.281502 0.281502 0.002993 6.252211e-04 0.039876 0.039876
120 0.144722 0.002841 0.347189 0.347189 10 50 stochastic L1 {'C': 10.0, 'iterations': 50, 'optimization': ... 112 0.347171 0.347171 0.341023 0.341023 0.353373 0.353373 0.003714 2.356292e-04 0.005042 0.005042
121 0.130516 0.003008 0.176695 0.176696 10 50 stochastic L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 133 0.178074 0.178074 0.169750 0.169750 0.182263 0.182263 0.004710 2.973602e-07 0.005201 0.005201
122 0.126494 0.002506 0.231626 0.231629 10 50 stochastic L1/L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 118 0.183896 0.183896 0.229489 0.229489 0.281502 0.281502 0.001847 4.092894e-04 0.039876 0.039876
123 0.124163 0.003008 0.347189 0.347189 10 50 bfgs L1 {'C': 10.0, 'iterations': 50, 'optimization': ... 112 0.347171 0.347171 0.341023 0.341023 0.353373 0.353373 0.001705 4.093868e-04 0.005042 0.005042
124 0.128936 0.002841 0.176695 0.176696 10 50 bfgs L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 133 0.178074 0.178074 0.169750 0.169750 0.182263 0.182263 0.001633 2.364159e-04 0.005201 0.005201
125 0.128996 0.003176 0.231626 0.231629 10 50 bfgs L1/L2 {'C': 10.0, 'iterations': 50, 'optimization': ... 118 0.183896 0.183896 0.229489 0.229489 0.281502 0.281502 0.003954 2.368655e-04 0.039876 0.039876
126 0.251409 0.002676 0.342310 0.342310 10 100 full L1 {'C': 10.0, 'iterations': 100, 'optimization':... 115 0.342165 0.342165 0.335909 0.335909 0.348857 0.348857 0.002328 4.737354e-04 0.005287 0.005287
127 0.257082 0.002832 0.178473 0.178473 10 100 full L2 {'C': 10.0, 'iterations': 100, 'optimization':... 130 0.175571 0.175571 0.177639 0.177639 0.182209 0.182209 0.002788 2.300992e-04 0.002773 0.002773
128 0.262581 0.002340 0.223757 0.223758 10 100 full L1/L2 {'C': 10.0, 'iterations': 100, 'optimization':... 121 0.236942 0.236942 0.194505 0.194505 0.239826 0.239826 0.002760 4.727191e-04 0.020719 0.020718
129 0.266150 0.002674 0.342310 0.342310 10 100 stochastic L1 {'C': 10.0, 'iterations': 100, 'optimization':... 115 0.342165 0.342165 0.335909 0.335909 0.348857 0.348857 0.009864 4.727753e-04 0.005287 0.005287
130 0.256714 0.002507 0.178473 0.178473 10 100 stochastic L2 {'C': 10.0, 'iterations': 100, 'optimization':... 130 0.175571 0.175571 0.177639 0.177639 0.182209 0.182209 0.005653 4.094840e-04 0.002773 0.002773
131 0.257297 0.002841 0.223757 0.223758 10 100 stochastic L1/L2 {'C': 10.0, 'iterations': 100, 'optimization':... 121 0.236942 0.236942 0.194505 0.194505 0.239826 0.239826 0.005359 2.362471e-04 0.020719 0.020718
132 0.252674 0.002841 0.342310 0.342310 10 100 bfgs L1 {'C': 10.0, 'iterations': 100, 'optimization':... 115 0.342165 0.342165 0.335909 0.335909 0.348857 0.348857 0.003949 2.361347e-04 0.005287 0.005287
133 0.259383 0.002841 0.178473 0.178473 10 100 bfgs L2 {'C': 10.0, 'iterations': 100, 'optimization':... 130 0.175571 0.175571 0.177639 0.177639 0.182209 0.182209 0.003479 2.363595e-04 0.002773 0.002773
134 0.258859 0.002674 0.223757 0.223758 10 100 bfgs L1/L2 {'C': 10.0, 'iterations': 100, 'optimization':... 121 0.236942 0.236942 0.194505 0.194505 0.239826 0.239826 0.003868 2.361347e-04 0.020719 0.020718

135 rows × 20 columns

In [147]:
print("The most accurate model was:", clf.best_score_)
print("was found using the following parameters:")
print(clf.best_estimator_.get_params())
The most accurate model was: 0.447497756231
was found using the following parameters:
{'C': 0.10000000000000001, 'eta': 0.1, 'iterations': 100, 'optimization': 'full', 'regularization': 'L1'}

Let's take a closer look at how performance varied over various parameters. Holding the optimization and iterations parameters constant, how did the regularization affect the results?

In [148]:
batch = cvresults[cvresults.param_optimization == 'full']
batch100 = batch[batch.param_iterations == 100]

sns.pointplot(batch100.param_C.values, batch100.mean_test_score.values, hue=batch100.param_regularization)
plt.title('Accuracy vs. Reg Constant C over different Reg Techniques')
plt.show()

Looks like the method of regularization is relatively insignificant for small values of C but as it increases the L1 method prevails. In general however, the smaller the value of C the better the performance.

Which, if any, of the optimization techniques performed the fastest?

In [149]:
L1L2 = cvresults[cvresults.param_regularization == 'L1/L2']
L1L2_C1 = L1L2[L1L2.param_C == .001]

sns.pointplot(L1L2_C1.param_iterations.values, L1L2_C1.mean_fit_time.values, hue=L1L2_C1.param_optimization)
plt.title('Fit time vs. # of Iterations by Optimization Technique')
plt.show()

Looks like the optimization techniques aren't really optimizing much. They all have about the same fit_time.

Comparison to SKLEARN

SKLearn has been helping us a lot here. Let's see how we compare against their implementation

In [150]:
from sklearn.linear_model import LogisticRegressionCV as SKLogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


lr_sk = SKLogisticRegression() # all params default

%time lr_sk.fit(X_train,y_train.ravel())
print(np.hstack((lr_sk.intercept_[:,np.newaxis],lr_sk.coef_)))
yhat = lr_sk.predict(X_test)

best_score = accuracy_score(y_test.ravel(),yhat)

print("The most accurate model was:", best_score)
print("was found using the following parameters:")
print(lr_sk.get_params())
print('values of C (for each class, bc sklearn is dope like that)', lr_sk.C_)
Wall time: 634 ms
[[ -2.40257203e+00   3.86432375e-01   1.47030805e+00  -2.52993380e+00
    2.26698664e+00  -1.59379327e+00]
 [ -1.40249329e+00  -7.52004972e-03   8.17459669e-02  -9.90454450e-02
   -6.92681494e-04   2.55122093e-02]
 [ -1.44884147e+00  -2.68312712e-02   2.24797889e-02  -1.58024434e-02
   -1.95213583e-02   3.96752841e-02]
 [ -1.32295500e+00  -3.84635921e-03  -6.61654044e-02   1.09343882e-01
   -4.49216515e-02   5.58953277e-03]
 [ -2.06229195e+00   2.29592815e-01  -5.49609444e-01   2.20256699e+00
   -2.18204299e-01  -1.66434607e+00]]
The most accurate model was: 0.449673558215
was found using the following parameters:
{'Cs': 10, 'class_weight': None, 'cv': None, 'dual': False, 'fit_intercept': True, 'intercept_scaling': 1.0, 'max_iter': 100, 'multi_class': 'ovr', 'n_jobs': 1, 'penalty': 'l2', 'random_state': None, 'refit': True, 'scoring': None, 'solver': 'lbfgs', 'tol': 0.0001, 'verbose': 0}
values of C (for each class, bc sklearn is dope like that) [  4.64158883e-02   1.00000000e-04   1.00000000e-04   1.00000000e-04
   3.59381366e-01]

SKLearn is reporting a different set of parameters to optimize the accuracy, albeit for about the same accuracy.

In [151]:
mimic = cvresults[(cvresults.param_optimization == 'bfgs') & (cvresults.param_iterations == 100) & (cvresults.param_regularization == 'L2')]
mimic
Out[151]:
mean_fit_time mean_score_time mean_test_score mean_train_score param_C param_iterations param_optimization param_regularization params rank_test_score split0_test_score split0_train_score split1_test_score split1_train_score split2_test_score split2_train_score std_fit_time std_score_time std_test_score std_train_score
25 0.273372 0.002850 0.440207 0.440207 0.001 100 bfgs L2 {'C': 0.001, 'iterations': 100, 'optimization'... 16 0.433079 0.433079 0.448585 0.448585 0.438955 0.438955 0.021189 2.230974e-04 0.006392 0.006392
52 0.267314 0.003017 0.438792 0.438792 0.01 100 bfgs L2 {'C': 0.01, 'iterations': 100, 'optimization':... 19 0.431393 0.431393 0.446736 0.446736 0.438248 0.438248 0.006097 1.275979e-05 0.006276 0.006275
79 0.280786 0.003008 0.431792 0.431792 0.1 100 bfgs L2 {'C': 0.1, 'iterations': 100, 'optimization': ... 67 0.418825 0.418825 0.442764 0.442764 0.433787 0.433787 0.003315 5.150430e-07 0.009875 0.009874
106 0.255389 0.002506 0.376496 0.376496 1 100 bfgs L2 {'C': 1.0, 'iterations': 100, 'optimization': ... 100 0.375517 0.375517 0.373395 0.373395 0.380577 0.380577 0.007819 4.092894e-04 0.003013 0.003013
133 0.259383 0.002841 0.178473 0.178473 10 100 bfgs L2 {'C': 10.0, 'iterations': 100, 'optimization':... 130 0.175571 0.175571 0.177639 0.177639 0.182209 0.182209 0.003479 2.363595e-04 0.002773 0.002773

These are roughly the same value for small values of C (mean_test_score ~= 0.44) but it's hard to compare because sklearn has a different C for each class, and we are only using one.

In [152]:
print("The most accurate model was:", clf.best_score_)
print("was found using the following parameters:")
print(clf.best_estimator_.get_params())
The most accurate model was: 0.447497756231
was found using the following parameters:
{'C': 0.10000000000000001, 'eta': 0.1, 'iterations': 100, 'optimization': 'full', 'regularization': 'L1'}

Deployment

Even though it's hard to determine which method is better. We would recommond the method from sklearn. One of the main reason is the accuracy of the predictions. At the first time when we tried to implement the logic regression, we didn't perform the PCA to reduce the features to 5. We had more than 20 features. The accuracy we can get from our method is about 45% while the one from sklearn can get about 56%.

This might be because sklearn is using different value of C for each class as mentioned before, or some other implicit process they did inside their method which can be why theirs took longer time to execute. But we can assume their method can perform better with a larger number of features than ours.

In [176]:
x1 = stats 

x1 = preprocessing.scale(x1, axis=1) # with 21 features
x1.shape
Out[176]:
(18380, 21)
In [177]:
X_train, X_test, y_train, y_test = train_test_split(x1, y, test_size=0.2)
In [178]:
%%time
#execute with our method:
# we  use the parameters that performs the best from above. 
lrc = LRClassifier(eta=0.1, C = 0.10000000000000001, iterations = 100, optimization = 'full', regularization = 'L1') 
lrc.fit(X_train, y_train)
yhat = lrc.predict(X_test)
score = accuracy_score(y_test,yhat+1)
print("The accuracy:", score)
The accuracy: 0.475516866159
Wall time: 771 ms
In [180]:
%%time 

lr_sk = SKLogisticRegression() # all params default

lr_sk.fit(X_train,y_train.ravel())
yhat = lr_sk.predict(X_test)

score = accuracy_score(y_test.ravel(),yhat)

print("The accuracy:", score)
The accuracy: 0.558215451578
Wall time: 4.9 s